Confidence intervals for two sample means: Calculation, interpretation, and a few simple rules

Valued by statisticians, enforced by editors, and confused by many authors, standard errors (SEs) and confidence intervals (CIs) remain a controversial issue in the psychological literature. This is especially true for the proper use of CIs for within-subjects designs, even though several recent publications elaborated on possible solutions for this case. The present paper presents a short and straightforward introduction to the basic principles of CI construction, in an attempt to encourage students and researchers in cognitive psychology to use CIs in their reports and presentations. Focusing on a simple but prevalent case of statistical inference, the comparison of two sample means, we describe possible CIs for between- and within-subjects designs. In addition, we give hands-on examples of how to compute these CIs and discuss their relation to classical t-tests.

analysis -the comparison of two means from independent groups as well as from dependent groups/conditions -and describe appropriate CIs in an intuitive framework. In this framework, we suggest that using a much simpler approach to within-subjects CIs than suggested in often-referenced papers (e.g., Loftus & Masson, 1994) is preferable in most cases. In particular, this approach simply relies on the CI of the difference between sample means (see Franz & Loftus, 2012, for a more detailed discussion of this approach and its advantages over other approaches). Such CIs are more initimately related to their betweensubjects counterparts, are easily obtained with any computer program, and allow for a straightforward interpretation.
In the following section, we outline how different CIs can be computed for the common situation of comparing two sample means (cf. Table 1). These guidelines are intended to simplify graphical data presentation in a unifying framework that is intimately related to the different t-tests in classical hypothesis testing.

a CI iS a CI iS a CI
Independent of the underlying design, any CI for a sample mean can be broken down to a simple formula that only includes the mean itself, an appropriate SE, and a coefficient that is derived from the t-distribution. In the following, we will use 95% CIs (i.e., α = .05) in all examples because of their widespread use in the literature (cf. Equation 1): (1) All CIs computed with this formula rely on the same assumptions as t-tests in classical hypothesis testing do. More precisely, they assume a normally distributed variable with an unknown population variance that is estimated from the sample (thus implying measurement at the interval scale). Furthermore, these CIs are inherently two-tailed, as reflected by the use of α/2 to determine the coefficient.
Most importantly, particular CIs differ in how the corresponding SE is computed, and the appropriate formula depends on two factors: (1) the experimental design and (2) the intended meaning of the CI.
We start by discussing two CIs for between-subject designs before continuing with within-subjects designs. Following these theoretical points, we demonstrate how to compute the three CIs for an examplary data set.

Between-subjects designs: Two independent samples
For between-subjects designs, two distinct CIs can be computed that differ in meaning and interpretation. At first sight, the most straightforward way might be to compute separate CIs for each individual mean M by simply using the corresponding SE. In fact, this is a valid solution and we will denote the resulting SE as SE M . Following from the central limit theorem, SE M is computed by dividing the unbiased estimator of the standard deviation (s) by the square root of the sample size n (see Equation 2): (2) The corresponding CI for individual means is denoted as CI M (cf. Equation 3): (3)

Parameter
A parameter is a fixed, but unknown population value. Sample statistics are used to estimate parameters.
Standard error (SE) Measure for the standard deviation of a parameter estimator. In case of a sample mean, it is equal to the estimated standard deviation divided by the square root of the underlying sample size.
Confidence interval (CI) An estimate for plausible population parameters. Several different CIs can be constructed for the comparison of two means, depending on the employed design and the desired interpretation. Still, each CI can be broken down to the simple formula: "Mean ± Standard Error × Coefficient" (CI = M ± SE × t df; 1 -α/2 ).

Confidence interval for an individual mean (CI M )
This CI is constructed from the standard error of the mean (SE M ) and can be used to compare this mean to any fixed parameter. It corresponds to a one-sample t-test and does not yield any precise information about the difference between two sample means.
Confidence interval for the difference between two means from independent samples (CI D ) This CI is constructed from the between-subjects standard error of the difference between two means (SE D ). It thus corresponds to a t-test for independent samples and can be used for inferences about the difference between both means.
Confidence interval for the paired difference between two means (CI PD ) This CI is constructed from the standard error of the difference between two dependent sample means (paired differences). It is thus applicable for within-subjects designs and equivalent to a paired-samples t-test.
The corresponding CI for the difference between two independent means is denoted as CI D (cf. Equation 5): In addition to the general assumptions mentioned above, the CI D assumes that the standard deviations are estimated from independent samples and that the size of these standard deviations is comparable (i.e., we assume homogeneity of variance) 1 . Importantly, conclusions based on the CI D are valid only for the difference between the means, and the CI D thus corresponds to the t-test for two independent samples.

Within-subjects designs: Two paired samples
For within-subjects designs, matters seem to be more complicated at first sight. In fact, Cumming and Finch (2005) recommended: "For paired data, interpret the mean of the differences and error bars for this mean. In general, beware of bars on separate means for a repeatedmeasure independent variable: They are irrelevant for inferences about differences" (p. 180).
Caution is indeed necessary in this situation, because the oftenused CI M obviously is unrelated to the within-subjects difference. Yet, several CIs for within-subjects designs have been proposed in the last decades (Cousineau, 2005;Jarmasz & Hollands, 2009;Loftus & Masson, 1994) with the most prevalent variant being the one of Loftus and Masson. These CIs are typically derived from the error term of the repeated-measures ANOVA and we will come back to these methods in the section What to do with more complex designs. For comparing the means of two paired samples, however, a straightforward and elegant solution seems to be more closely related to meaning and interpretation of CIs for between-subjects designs (see also Franz & Loftus, 2012). This solution simply uses the standard error of the paired differences (SE PD ) to construct the CI. Accordingly, it does not require any ANOVA statistics, but can be computed easily from the standard deviation of individual difference scores s d (see Equation 6): The corresponding CI is labelled CI PD (following Franz & Loftus, 2012; cf. Equation 7): The CI PD is thus equivalent to the confidence interval of the difference between both paired means and corresponds directly to a paired-samples t-test. When plotted around the actual sample means, this t-test is significant if one mean is not part of the CI PD around the other mean; consequently CI PD is a direct within-subjects counterpart of the CI D for independent samples. Taken together, we suggest that the CI PD can be computed more easily and seems to be more closely related to interpreting the difference between two dependent means than any other solution. Table 2 shows the data of a fictitious -and rather arbitrary -study in which participants indicated their affection for the experimenter on a rating scale. This scale ranges from -10 (dislike) to 0 (neutral) to 10 (affection). Condition 1 is a control condition without any specific treatment whereas the experimenter used a healthy dose of pheromones in Condition 2.

affection, pheromoneS, and CIS: a handS-on example
Different CIs are possible in this situation, depending on the actual design and the CI's intended meaning. The most important question, of course, relates to the design: Different CIs are appropriate depending on whether the data result from a between-subjects design (different participants contributed to Condition 1 and Condition 2, respectively) or a within-subjects design (the data in each row belong to a single participant). The three different CIs described above are plotted in Figure 1 and will be discussed in the following (see Appendix A for a short tutorial on how to compute these intervals with common computer programs).

Between-subjects: CIs for individual means
Confidence intervals for individual means can be computed easily based on the two standard deviations in Table 2. Accordingly, the two SEs amount to the following values (Equation 8): The two CI M indicate that the mean affection ratings are significantly different from zero for both conditions, that is, participants were positively biased toward the experimenter even when not affected by pheromones. Importantly, however, the CI M are not informative for the difference between the affection rating of the control participants and the participants who were exposed to pheromones.

Between-subjects: CIs for the difference
The SE D is equivalent to the SE that is used for the t-test for inde- three different confidence intervals (CIs) for two sample means. the raw data are plotted in the center of the figure; dots represent individual data points (five observations per mean; see also table 2). Panels A and B show CIs that are approriate for between-subjects designs; Panel c shows a CI that is appropriate for within-subjects designs (pairs of values are indicated by dashed lines in the raw data). Panel A. CIs for individual means (CI m ) rely on the standard error (SE) of the corresponding mean. the CI m indicates whether this mean is significantly different from any given (fixed) value. they do not inform about the statistical significance of the difference between the means. Panel B. CI for the difference between the means (CI d ). the means are significantly different (as judged by t-tests for independent samples) if one mean is not included in the CI d around the other mean. Panel c. Within-subjects CI, constructed from the paired difference scores (CI Pd ). two means from paired samples are significantly different (as judged by a paired-samples t-test) if one mean is not included in the CI Pd around the other mean.
Reported affection for the experimenter as indicated on a rating scale (-10 to 10).

example data
Note. Condition 1 is a control condition without any specific treatment, whereas the experimenter had used a dose of pheromones in Condition 2. In the following equations, we will use the indices 1 and 2 to refer to the control condition and the pheromone condition, respectively.
This CI D is plotted around each mean in Panel B of Figure 1.
The mean rating of the control participants is clearly included in the CI D around the mean of the participants who were exposed to pheromones -both values are thus not significantly different as judged by a t-test for independent samples.

Within-subjects: CI for the difference
In contrast to the previous CIs, we now assume the data in Table 2 to result from a within-subjects design: Participants were first tested in the control condition and then exposed to the pheromones (or vice versa). Accordingly, the two ratings in each row of Table 2 are now assumed to belong to the same individual. The now appropriate CI PD is based on the pairwise difference scores for the data in The CI PD is plotted around each mean in Panel C of Figure 1. The mean of the control condition is clearly not included in the CI PD around the mean of the pheromones condition. This is equivalent to a significant effect as revealed by a paired-samples t-test.

deciding what to plot
As we have seen in the above example, different and equally possible SEs and CIs for a given situation can vary substantially and do convey different information. On closer inspection, the question which one to plot boils down to the question whether the difference between the two means is of major interest or not.
If the difference is indeed of interest, we suggest that each mean is best accompanied by the CI of the difference that is appropriate for the employed design (i.e., either CI D or CI PD ). As noted above, these intervals allow direct inferences about the difference and have also been labelled inferential CIs for this reason (Tryon, 2001). In addition to plotting these CIs, it is of course equally important to describe what is plotted. Here, a typical description to be used in a figure caption would be "Error bars represent the XY% confidence interval of the difference".
Alternatively, a concise description is also possible with the nomenclature suggested in this article that can be used to specify the plotted CI or SE on the axis of a graph (e.g., "RT ± SE PD " or "RT and CI PD " for a within-subjects design using response time as dependent variable).
An additional option to this approach can be used if it is only the difference that counts whereas the actual means are not of interest. In this case, it is also possible to plot only the difference itself, accompanied by the corresponding CI (i.e., CI D or CI PD ).
If the difference between the two means is not of major interest, however, we suggest to plot the CI M or SE M for each individual mean. Here, a typical description to be used in a figure caption would be "Error bars represent the XY% confidence interval of the individual means" or, to use the suggested nomenclature, a similar statement on the axis of a graph (e.g., "RT ± SE M " or "RT and CI M "). These error bars inform about the homogeneity of variance across different samples or conditions and -even though they cannot be used for inferences about the difference between two means -they provide information about the difference of each mean from a fixed parameter.
what to do with more complex deSignS?
The framework described in the preceding sections provides a straightforward and intuitive approach to CIs for means from two conditions for both, between-and within-subjects designs. These CIs can be mapped directly to the different t-tests in classical hypothesis testing and, as mentioned above, they also rely on the same statistical assumptions as the corresponding test. The described method of plotting the appropriate CI for the difference -CI D or CI PD , respectively -can also be applied to more complex designs given that specific pairwise comparisons are crucial for the research question at hand (Franz & Loftus, 2012). If applicable, this method might indeed be the easiest and thus favorable strategy.
Still, this approach has obvious limitations regarding complex studies which include numerous conditions. In such factorial designs, CIs are typically constructed from the error term of the ANOVA omnibus test. For between-subjects designs, appropriate methods are described comprehensively in several publications (e.g., Keppel & Wickens, 2004; cf. also Estes, 1997). As noted above, different methods have been proposed also for factorial within-subjects designs (Cousineau, 2005;Jarmasz & Hollands, 2009;Loftus & Masson, 1994) with the most prevalent variant being the one of Loftus and Masson (1994; see also Baguley, 2012;Bakeman & McArthur, 1996;Masson & Loftus, 2003;Hollands & Jarmasz, 2010;Tryon, 2001). Using these methods, howev-    Loftus and Masson (1994) are not directly equivalent to t-tests for paired samples but have to be multiplied by a fixed factor to allow for inferences about possibly significant effects (i.e., in the case of two groups/conditions: CI PD = √2 × CI Loftus & Masson ). Excellent examples on how to compute and interpret these CIs can be found in the corresponding articles.

concluding remarkS
In the preceding sections, we have summarized three approaches to CIs for one of the most common designs in psychological research, that is, the comparison of two sample means. Clearly, different CIs need to be computed for between-and within-subjects designs (cf. Blouin & Riopelle, 2005;Cumming & Finch, 2005;Estes, 1997;Loftus & Masson, 1994) and the particular CI used in a plot needs to be specified. To this end, we suggested an easy nomenclature for three different CIs to facilitate communication about what exactly a given CI represents (see Table 1). Furthermore, we argue that CIs for the difference between two means (CI D and CI PD ) are most informative in the majority of cases, because they can be interpreted intuitively. These CIs provide a straightforward approach to the described setting; more complex designs of course call for different approaches to CIs which can be found in a variety of recent articles.

Footnotes
1 In the rare case of two equally sized samples with numerically identical standard deviations, the SE M is informative also for the difference between the means. Here, it is directly related to SE D with SE D = √2 × SE M . If sample sizes or standard deviations are (even slightly) dissimilar, however, this relation is not valid. It should also be noted that this relation is only valid for SEs but not for the corresponding CIs: The coefficient of the CI M has n -1 degrees of freedom (df) whereas the coefficient of the CI D has (n 1 + n 2 -2)df.

appendix a
In the following, we show how the different CIs can be computed by common software packages, such as SPSS, MS Excel, and R.

SPSS
CIs for individual means (CI M ) can be computed with the Explore command: Here, the menu Statistics allows to set the α level (default: 5%).
Alternatively, the CI M is also contained in the output of the one-sample t-test in a section labelled 95% Confidence Interval of the Difference (with the α level being set in the Options menu). Both ways of computing the CI result in a CI that is specified via lower and upper boundaries which can be easily transformed to the the notation that is used in this article (see Equation A1):

(A1)
CIs for the difference between independent means (CI D ) and CIs for the difference between paired means (CI PD ) can be obtained by using the same formula on the output of the t-test for independent-samples and the paired-samples t-test, respectively. These outputs also contain the corresponding values for SE D or SE PD .

Microsoft Excel
Computing CIs for individual means (CI M ) in MS Excel requires several quick steps. First, the standard deviation s is computed with STDEV function. Dividing this value by √n returns the SE M . This can be done by using SQRT(n) or by computing n with the COUNT function.
Finally, the SE M is multiplied with the coefficient taken from the tdistribution which can be accessed via the TINV function. Importantly, TINV is inherently two-tailed and takes the intended α level as input.
The CI for the difference of two independent means, CI D , is computed similarly with two changes. Most importantly, the SE D is computed using the corresponding formula in the section Between-subjects designs: Two independent samples (using SQRT for the square root and "^2" to denote exponents). Furthermore, the critical t-value has to be requested with the correct number of dfs via TINV(0.05, n 1 + n 2 -2).
In contrast to CI M and CI D , the CI for the difference between paired means, CI PD , requires an additional first step. Assuming that each condition is entered in a separate column, one first needs to compute pairwise differences ( Figure A1). The estimated standard deviation of the difference scores s d can now be computed via the STDEV function.
Dividing this value by √n returns the SE PD . This can again be done by using SQRT(n). CI PD is then computed by multiplying the SE PD with the coefficient computed by TINV(0.05, n-1) for t n -1; 0.975 .

R
Confidence intervals are included in the output of the function t.test.
As in SPSS, CIs are typically given in terms of lower and upper boundaries. These values can be accessed directly to arrive at the notation that is used in this article: Accessing the boundaries works similarly for all t-tests and we will demonstrate the general procedure for the one-sample t-test and the corresponding CI M . First, we enter the data of Condition 1 as a vector and compute the one-sample t-test via t.test. The output is stored in the new variable result: > cond1 <-c (7,3,4,2,5) > result <-t.test(cond1) The boundaries can now be addressed by result$conf.int which returns a vector containing both values. The length of an individual error bar can now be computed in the following way: > (max(result$conf.int) -min(result$conf.int))/2 [1] 2.388388